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Description 

Reld of the Invention 

s [0001 J The present Invention relates generally to the field of gene expression and specifically to a method for the 
serial analysis of gene expression (SAGE) for the analysis of a large number of transcripts by production of ditag 
oligonucleotides comprising at least two defined nucleotide sequence tags, wherein the defined nucleotide sequence 
tags comprise a defined region of a transcript which corresponds to a region of expressed gene. 

10 Background of the Invention 

{0002] Determination of the genomic sequence of higher organisms, including humans, is now a real and attainable 
goal. However, this analysis only represents one level of genetic complexity. The ordered and timely expression of 
genes represents another level of complexity equally important to the definition and biology of the organism. 

is [0003] The role of sequencing complementary DNA (cDNA), reverse transcribed from mRNA, as part of the human 
genome project has been debated as proponents of genomic sequencing have argued the difficulty of finding every 
mRNA expressed in all tissues, cell types, and developmental stages and have pointed out that much valuable Infor- 
mation from Intronlc and intergenic regions, including control and regulatory sequences, will be missed by cDNA se- 
quencing (Report of the Committee on Mapping and Sequencing the Human Genome, National Academy Press, Wash- 

20 ington, D.C., 1 988). Sequencing of transcribed regions of the genome using cDNA libraries has heretofore been con- 
sidered unsatisfactory. Libraries of cDNA are believed to be dominated by repetitive elements, mitochondrial genes, 
ribosomal RNA genes, and other nuclear genes comprising common or housekeeping sequences. It is believed that 
cDNA libraries do not provide all sequences corresponding to structural and regulatory polypeptides or peptides 
(Putney, etal., Nature, 302:718, 1983). 

25 [0004] Another drawback of standard cDNA cloning Is that some mRNAs are abundant while others are rare. The 
cellular quantities of mRNA from various genes can vary by several orders of magnitude. 

[0005] Techniques based on cDNA subtraction or differential display can be quite useful for comparing gene expres- 
sion differences between two cell types (Hedrick, etal., Nature, 308:149, 1984; Liang and Pardee, Science, 257 : 967, 
1992), but provide only a partial analysis, with no direct information regarding abundance of messenger RNA The 

30 expressed sequence tag (EST) approach has been shown to be a valuable tool for gene discovery (Adams, et a/., 
Science 252:1656, 1991; Adams, et a!., Nature, 355:632, 1992; Okubo ef a/., Nature Genetics, 2: 173, 1992), but flke 
Northern blotting, RNase protection, and reverse transcriptase-polymerase chain reaction (RT-PCR) analysis (Alwine, 
et aL, Proc. Natl. Acad Sci, USA, 74:5350, 1977; Zinn ef ai., Cell, 34-865, 1983; Veres, etal., Science, 237:415, 
1987), only evaluates a limited number of genes at a time. In addition, the EST approach preferably employs nucleotide 

35 sequences of 150 base pairs or longer for similarity searches and mapping. 

[0006] Sequence tagged sites (STSs) (Olson, ef a/., Science, 245:1434, 1989) have also been utilized to identity 
genomic markers for the physical mapping of the genome. These short sequences from physically mapped clones 
represent uniquely identified map positions in the genome. In contrast, the Identification of expressed genes relies on 
expressed sequence tags which are markers for those genes actually transcribed and expressed//) vivo. 

<o [0007] There is a need for an improved method which allows rapid, detailed analysis of thousands of expressed 
genes for the investigation of a variety of biological applications, particularly for establishing the overall pattern of gene 
expression In different cell types or in the same cell type under different physiologic or pathologic conditions. Identifi- 
cation of different patterns of expression has several utilities, including the identification of appropriate therapeutic 
targets, candidate genes for gene therapy (e.g., gene replacement), tissue typing, forensic identification, mapping 

45 locations of disease-associated genes, and for the identification of diagnostic and prognostic indicator genes. 

SUMMARY OF THE INVENTION 

[0008] The present invention provides a method for the rapid analysis of numerous transcripts in order to identify 
50 the overall pattern of gene expression in different cell types or in the same cell type under different physiologic, devel- 
opmental or disease conditions. The method is based on the identification of a short nucleotide sequence tag at a 
defined position In a messenger RNA. The tag is used to identify the corresponding transcript and gene from which it 
was transcribed. By utilizing dimerized tags, termed a "ditag", the method of the invention allows elimination of certain 
types of bias which might occur during cloning and/or amplification and possibly during data evaluation. Concatenation 
ss of these short nucleotide sequence tags allows the efficient analysis of transcripts in a serial manner by sequencing 
multiple tags on a single DNA molecule, for example, a DNA molecule inserted In a vector or in a single clone. 
[0009] The method described herein is the serial analysis of gene expression (SAGE), a novel approach which allows 
the analysis of a large number of transcripts. To demonstrate this strategy, short cDNA sequence tags were generated 
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from mRNA tsolated from pancreas, randomly paired to form ditags, concatenated, and cloned. Manual sequencing 
of 1,000 tags revealed a gene expression pattern characteristic of pancreatic function, identification of such patterns 
is Important dlagnostlcaily and therapeutically, for example. Moreover, the use of SAGE as a gene discovery tool was 
documented by the Identification and Isolation of new pancreatic transcripts corresponding to novel tags. SAGE pro- 
vides a broadly applicable means for the quantitative cataloging and comparison of expressed genes In a variety of 
normal, developmental, and disease states. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[001 0] FIGURE 1 shows a schematic of SAGE The first restriction enzyme, or anchoring enzyme, is NIail! and the 
second enzyme, or tagging enzyme, is Fok! in this example. Sequences represent primer derived sequences, and 
transcript derived sequences with "X" and "O" representing nucleotides of different tags. 

[0011] FIGURE 2 shows a comparison of transcript abundance. Bars represent the percent abundance as determined 
by SAGE (dark bars) or hybridization analysis (light bars). SAGE quantitations were derived from Table 1 as follows: 
TRY1/2 includes the tags for trypsinogen 1 and 2, PROCAR Indicates tags for procarboxypeptidase A1 , CHYMO In- 
dicates tags for chymotrypsinogen, and ELA/PRO Includes the tags for elastase NIB and protease E. Error bars rep- 
resent the standard deviation determined by taking the square root of counted events and converting It to a percent 
abundance (assumed Poisson distribution). 

[0012] FIGURE 3 shows the results of screening a cDNA library with SAGE tags. P1 and P2 show typical hybridization 
results obtained with 13 bp oligonucleotides as described in the Examples. P1 and P2 correspond to the transcripts 
described In Table 2. Images were obtained using a Molecular Dynamics Phosphor! mager and the circle indicates the 
outline of the filter membrane to which the recombinant phage were transferred prior to hybridization. 
[0013] FIGURE 4 is a block diagram of a tag code database access system in accordance with the present invention. 

DESCRIPTION OF THE PREFERRED EMBODIMENTS 

[0014] The present Invention provides a rapid, quantitative process for determining the abundance and nature of 
transcripts corresponding to expressed genes. The method, termed serial analysis of gene expression (SAGE), is 
based on the identification of and characterization of partial, defined sequences of transcripts corresponding to gene 
segments. These defined transcript sequence "tags" are markers for genes which are expressed In a cell, a tissue, or 
an extract, for example. 

[0015] SAGE is based on several principles. First, a short nucleotide sequence tag (9 to 10 bp) contains sufficient 
Information content to uniquely identify a transcript provided it is Isolated from a defined position within the transcript 
For example, a sequence as short as 9 bp can distinguish 262, 1 44 transcripts (4 9 ) given a random nucleotide distribution 
at the tag site, whereas estimates suggest that the human genome encodes about 60,000 to 200,000 transcripts (Fields, 
et a/., Nature Genetics, 7:3451 994). The size of the tag can be shorter for lower eukaryotes or prokaryotes, for example, 
where the number of transcripts encoded by the genome Is lower. For example, a tag as short as 6-7 bp may be 
sufficient for distinguishing transcripts in yeast 

[0016] Second, random dimerization of tags allows a procedure for reducing bias (caused by amplification and/or 
cloning). Third, concatenation of these short sequence tags allows the efficient analysis of transcripts in a serial manner 
by sequencing multiple tags within a single vector or done. As with serial communication by computers, wherein Infor- 
mation is transmitted as a continuous string of data, serial analysis of the sequence tags requires a means to establish 
the register and boundaries of each tag. Dimerized tags may be applied with or without concatenation, or in combination 
with other known methods of sequence identification. 

[0017] In a first embodiment, the Invention provides a method for the detection of gene expression in a particular 
ceil or tissue, or eel) extract, for example, Including at a particular developmental stage or in a particular disease state. 
The method comprises producing complementary deoxyribonucleic acid (cDNA) oligonucleotides, isolating a first de- 
fined nucleotide sequence tag from a first cDNA oligonucleotide and a second defined nucleotide sequence tag from 
a second cDNA oligonucleotide, linking the first tag to a first oligonucleotide linker, wherein the first oligonucleotide 
linker comprises a first sequence for hybridization of an amplification primer and linking the second tag to a second 
oligonucleotide linker, wherein the second oligonucleotide linker comprises a second sequence for hybridization of an 
amplification primer, and determining the nucleotide sequence of the tag(s), wherein the tag(s) correspond to an ex- 
pressed gene. 

[001 8] Figure 1 shows a schematic representation of the analysis of messenger RNA (mRNA) using SAGE as de- 
scribed in the method of the invention. mRNA Is isolated from a cell or tissue of interest for In vitro synthesis of a double- 
stranded DNA sequence by reverse transcription of the mRNA. The double-stranded DNA complement of mRNA 
formed Is referred to as complementary (cDNA). 

[001 9] The term "oligonucleotide 0 as used herein refers to primers or oligomer fragments comprised of two or more 
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deoxyribonucleotides or ribonucleotides, preferably more than three. The exact size will depend on many factors, which 
In turn depend on the ultimate function or use of the oligonucleotide. 

[0020] The method further Includes Ilgatlng the first tag linked to the first oligonucleotide linker to the second tag 
linked to the second oligonucleotide linker and forming a "dltag". Each dltag represents two defined nucleotide se- 
5 quences of at least one transcript, representative of at least one gene. Typically, a dltag represents two transcripts 
from two distinct genes. The presence of a defined cDNA tag within the ditag is Indicative of expression of a gene 
having a sequence of that tag. 

[0021] The analysis of dltags, formed prior to any amplification step, provides a means to eliminate potential distor- 
tions introduced by amplification, e.g., PCR. The pairing of tags for the formation of dltags Is a random event The 

10 number of different tags (s expected to be large, therefore, the probability of any two tags being coupled In the same 
ditag is small, even for abundant transcripts. Therefore, repeated ditags potentially produced by biased standard am- 
plification and/or cloning methods are excluded from analysis by the method of the Invention. 
[0022] The term "defined" nucleotide sequence, or "defined" nucleotide sequence tag, refers to a nucleotide se- 
quence derived from either the 5' or 3' terminus of a transcript The sequence Is defined by cleavage with a first restriction 

is endonuclease, and represents nucleotides either 5' or 3' of the first restriction endonuclease site, depending on which 
terminus is used for capture (e.g., 3* when oilgo-dT Is used for capture as described herein). 
[0023] As used herein, the terms "restriction endonucleases" and "restriction enzymes" refer to bacterial enzymes 
which bind to a specific double-stranded DNA sequence termed a recognition site or recognition nucleotide sequence, 
and cut double-stranded DNA at or near the specific recognition site. 

20 [0024] The first endonuclease, termed "anchoring enzyme" or "AE" in Figure 1 , is selected by its ability to cleave a 
transcript at least one time and therefore produce a defined sequence tag from either the 5 1 or 3' end of a transcript 
Preferably, a restriction endonuclease having at least one recognition site and therefore having the ability to cleave a 
majority of cDNAs is utilized. For example, as Illustrated herein, enzymes which have a 4 base pair recognition site 
are expected to cleave every 256 base pairs (4 4 ) on average while most transcripts are considerably larger. Restriction 

25 endonucleases which recognize a 4 base pair site include Nlalll, as exemplified in. the EXAMPLES of the present 
Invention. Other similar endonucleases having at least one recognition site within a DNA molecule (e.g., cDNA) will be 
known to those of skill In the art (see for example, Current Protocols in Molecular Biology, Vol. 2, 1995, Ed. Ausubel, 
et a/., Greene Publish. Assoc. & Wiley Interscience, Unit 3.1.15; New England Biolabs Catalog, 1995). 
[0025] After cleavage with the anchoring enzyme, the most 5* or 3' region of the cleaved cDNA can then be isolated 

30 by binding to a capture medium. For example, as Illustrated in the present EXAMPLES, streptavidln beads are used 
to Isolate the defined 3' nucleotide sequence tag when the oligo dT primer for cDNA synthesis is biotinylated. In this 
example, cleavage with the first or anchoring enzyme provides a unique site on each transcript which corresponds to 
the restriction site located closest to the poly-A tail. Likewise, the 5* cap of a transcript (the cDNA) can be utilized for 
labeling or binding a capture means for isolation of a 5* defined nucleotide sequence tag. Those of skill in the art will 

35 know other similar capture systems (e.g., biotin/streptavidin, digoxigenln/antkligoxlgenln) for isolation of the defined . 
sequence tag as described herein. 

[0026] The Invention Is not limited to use of a single "anchoring" or first restriction endonuclease. It may be desirable 
to perform the method of the Invention sequentially, using different enzymes on separate samples of a preparation, in 
order to identity a complete pattern of transcription for a ceil or tissue. In addition, the use of more than one anchoring 

40 enzyme provides confirmation of the expression pattern obtained from the first anchoring enzyme. Therefore, It Is also 
envisioned that the first or anchoring endonuclease may rarely cut cDNA such that few or no cDNA representing 
abundant transcripts are cleaved Thus, transcripts which are cleaved represent "unique" transcripts. Restriction en- 
zymes that have a 7-8 bp recognition site for example, would be enzymes that would rarely cut cDNA. Similarly, more 
than one tagging enzyme, described below, can be utilized in order to identify a complete pattern of transcription. 

45 [0027] The term "isolated" as used herein i ncludes polynucleotides substantially free of other nucleic acids, proteins, 
lipids, carbohydrates or other materials with which it is naturally associated. cDNA is not naturally occurring as such, 
but rather is obtained via manipulation of a partially purified naturally occurring mRNA. isolation of a defined sequence 
tag refers to the purification of the 5 1 or 3' tag from other cleaved cDNA. 

[0028] In one embodiment, the isolated defined nucleotide sequence tags are separated into two pools of cDNA, 
so when the linkers have different sequences. Each poo! is ligated via the anchoring, or first restriction endonuclease site 
to one of two linkers. When the linkers have the same sequence, it Is not necessary to separate the tags into pools. 
The first oligonucleotide linker comprises a first sequence for hybridization of an amplification primer and the second 
oligonucleotide linker comprises a second sequence for hybridization of an amplification primer. In addition, the linkers 
further comprise a second restriction endonuclease site, also termed the "tagging enzyme" or "TE". The method of the 
55 invention does not require, but preferably comprises amplifying the ditag oligonucleotide after ligation. 

[0029] The second restriction endonuclease cleaves at a site distant from or outside of the recognition site. For 
example, the second restriction endonuclease can be a type IIS restriction enzyme. Type IIS restriction endonucleases 
cleave at a defined distance up to 20 bp away from their asymmetric recognition sites (Szybalskl, W., Gene, 40:169, 
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1985). Examples of type US restriction endonurieases include BsmFl and FokJ.Other similar enzymes will be known 
to those of skill In the art (see, Current Protocols In Molecular Biology, supra). 

[0030] The first and second "linkers" which are Hgated to the defined nucleotide sequence tags are oligonucleotides 
having the same or different nucleotide sequences. For example, the linkers illustrated In the Examples of the present 
5 Invention Include linkers having different sequences: 

5-TTTTACCAGCrrATTC -3' 
10 (SEQIDNO:!) 

3'- ATGGTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT 

♦ 

16 (SEQIDNO:2) 

and 

20 5'- TT 1 1 1 GTAGACATTCTAGTATCTCGTCAAGTCGGAAGGGACATG -3' 
(SEQ ID NO:3) 

2s 3'- MCATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT -5 1 
(SEQIDNO:4X 



30 wherein A is a dideoxy nucleotide {e.g., dideoxy A). Other similar linkers can be utilized In the method of the Invention; 
those of skill In the art can design such alternate linkers. 

[0031] The (Inkers are designed so that cleavage of the ligation products with the second restriction enzyme, or 
tagging enzyme, results In release of the linker having a defined nucleotide sequence tag (e.g., 3' of the restriction 
endonudease cleavage site as exemplified herein). The defined nucleotide sequence tag may be from about 6 to 30 
35 base pairs. Preferably, the tag is about 9 to 11 base pairs. Therefore, a ditag is from about 12 to 60 base pairs, and 
preferably from 18 to 22 base pairs. 

[0032] The pool of defined tags llgated to linkers having the same sequence, or the two pools of defined nucleotide 
sequence tags ligated to linkers having different nucleotide sequences, are randomly iigated to each other "tall to tail". 
The portion of the cDNA tag furthest from the linker is referred to as the "tail". As Illustrated In FIGURE 1, the llgated 

to tag pair, or ditag, has a first restriction endonudease site upstream (5*) and a first restriction endonudease site down- 
stream (3*) of the ditag; a second restriction endonudease cleavage site upstream and downstream of the ditag, and 
a linker oligonucleotide containing both a second restriction enzyme recognition site and an amplification primer hy- 
bridization site upstream and downstream of the ditag. In other words, the ditag Is flanked by the first restriction endo- 
nudease site, the second restriction endonudease cleavage site and the linkers, respectively. 

45 [0033] The ditag can be amplified by utilizing primers which specifically hybridize to one strand of each linker. Pref- 
erably, the amplification is performed by standard polymerase chain reaction (PCR) methods as described (U.S. Patent 
No. 4,683,195). Alternatively, the ditags can be amplified by cloning in prokaryotlc-compatlble vectors or by other 
amplification methods known to those of skill in the art 

[0034] The term °primer n as used herein refers to an oligonucleotide, whether occurring naturally or produced syrv 
so thetically, which is capable of acting as a point of initiation of synthesis when placed under conditions in which synthesis 
of primer extension product which Is complementary to a nucleic acid strand is induced, I.e., in the presence of nude- 
otides and an agent for poiymerization such as DNA polymerase and at a suitable temperature and pH. The primer is 
preferably single stranded for maximum efficiency In amplification. Preferably, the primer is an oligodeoxy ribonuds- 
otide. The primer must be sufficiently long to prime the synthesis of extension products in the presence of the agent 
55 for polymerization. The exact lengths of the primers will depend on many factors, including temperature and source of 
primer. 

[0035] The primers herein are selected to be "substantially" complementary to the different strands of each specific 
sequence to be amplified. This means that the primers must be suffidently complementary to hybridize with their 
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respective strands. Therefore, the primer sequence need not reflect the exact sequence of the template. In the present 
Invention, the primers are substantially complementary to the oligonucleotide linkers. 

[0036] Primers useful for amplification of the linkers exemplified herein as SEQ ID N0:1«4 include S-CCAGCTTAT- 
TCAATTCGGTCC-3' (SEQ ID NO:5) and 5'-GTAGACATTCTAGTATCTCGT-3' (SEQ ID NO:6). Those of skill in the art 
fi can prepare similar primers for amplification based on the nucleotide sequence of the linkers without undue experi- 
mentation. 

[0037] Cleavage of the amplified PCR product with the first restriction endonuclease allows Isolation of ditags which 
can be concatenated by ligation. After ligation, It may be desirable to clone the concatemers, although It is not required 
In the method of the invention. Analysis of the ditags or concatemers, whether or not amplification was performed, is 

10 by standard sequencing methods. Concatemers generally consist of about 2 to 200 ditags and preferably from about 
8 to 20 ditags. While these are preferred concatemers, it will be apparent that the number of ditags which can be 
concatenated will depend on the length of the individual tags and can be readily determined by those of skill In the art 
without undue experimentation. After formation of concatemers, multiple tags can be cloned into a vector for sequence 
analysis, or alternatively, ditags or concatemers can be directly sequenced without cloning by methods known to those 

ffi of skill in the art 

[0038] Among the standard procedures for cloning the defined nucleotide sequence tags of the Invention Is Insertion 
of the tags Into vectors such as plasmids or phage. The ditag or concatemers of ditags produced by the method de- 
scribed herein are cloned into recombinant vectors for further analysis, e.g., sequence analysis, plaque/plasmid hy- 
bridization using the tags as probes, by methods known to those of skill in the art 

20 [0039] The term "recombinant vector" refers to a plasmid, virus or other vehicle known in the art that has been • 
manipulated by Insertion or incorporation of the ditag genetic sequences. Such vectors contain a promoter sequence 
which facilitates the efficient transcription of the a marker genetic sequence for example. The vector typically contains 
an origin of replication, a promoter, as well as specific genes which allow phenotyplc selection of the transformed cells. 
Vectors suitable for use In the present invention include for example, pBIueScript (Stratagene, La Jolla, CA); pBC, 

25 pSL301 (Invltrogen) and other similar vectors known to those of skill in the art Preferably, the ditags or concatemers 
thereof are ligated into a vector for sequencing purposes. 

[0040] Vectors in which the ditags are cloned can be transferred into a suitable host cell. "Host cells" are ceils In 
which a vector can be propagated and its DNA expressed. The term also includes any progeny of the subject host cell. 
It is understood that all progeny may not be identical to the parental cell since there may be mutations that occur during 
30 replication. However, such progeny are included when the term "host ceir is used Methods of stable transfer, meaning 
that the foreign DNA is continuously maintained in the host, are known in the art 

[0041] Transformation of a host ceil with a vector containing ditag(s) may be carried out by conventional techniques 
as are well known to those skilled in the art. Where the host is prokaryotlc, such as E coll, competent ceils which are 
capable of DNA uptake can be prepared from cells harvested after exponential growth phase and subsequently treated 
35 by the CaCI 2 method using procedures weii known in the art Alternatively, MgC^ or RbCi can be used. Transformation 
can also be performed by eiectroporation or other commonly used methods In the art. 

[0042] The ditags In a particular clone can be sequenced by standard methods (see for example, Current Protocols 
in Molecular Biology, supra, Unit 7) either manually or using automated methods. - 

[0043] In another embodiment, the present-invention provides a kit useful for detection of gene expression wherein 
40 the presence of a ditag is indicative of expression of a gene having a sequence of the tag, the kit comprising one or 
more containers comprising a first container containing a first oligonucleotide linker having a first sequence useful for 
hybridization of an amplification primer; a second container containing a second oligonucleotide linker having a second 
sequence useful for hybridisation of an amplification primer, wherein the linkers further comprise a restriction endonu- 
clease site for cleavage of DNA at a site distant from the restriction endonuclease recognition site; and a third and 
45 fourth continuer having a nucleic acid primers for hybridisation to the first and second unique sequence of the linker. 
It is apparent that if the oligonucleotide linkers comprise the same nucleotide sequence, only one container containing 
linkers is necessary in the kit of the Invention. 

[0044] tn yet another embodiment, the invention provides an oligonucleotide composition having at least two defined 
nucleotide sequence tags, wherein the defined nucleotide sequence tags comprise sequence 5' of a 5'-most cleavage 
so site of a restriction endonuclease or 3' of a 3-most cleavage site of a restriction endonuclease in a full length cDNA, 
wherein at least one of the sequence tags corresponds to at least one expressed gene. The composition consists of 
about 1 to 200 ditags, and preferably about 8 to 20 ditags. Such compositions are useful for the analysis of gene 
expression by identifying the defined nucleotide sequence tag corresponding to an expressed gene in a cell, tissue or 
cell extract, for example. 

55 [0045] It is envisioned that the identification of differentially expressed genes using the SAGE technique of the In- 
vention can be used In combination with other genomics techniques. For example, ditags can be hybridized with oli- 
gonucleotides immobilized on a solid support (e.g., nitrocellulose filter, glass slide, silicon chip). Such techniques in- 
clude "parallel sequence analysis 0 or PSA, as described below. The sequence of ditags formed by the method of the 
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Invention can also be determined using limiting dilutions by methods Including clonal sequencing (CS). 
[0048] Briefly, PSA Is performed after ditag preparation, wherein the oligonucleotide sequences to which the ditags 
are hybridized are preferably unlabeled and the ditag is preferably detectabiy labeled. Alternatively, the oligonucleotide 
can be labelled rather than the ditag. The ditags can be detectabiy labelled, for example, with a radioisotope, a fluo- 
5 rescent compound, a bioluminescent compound, a cheml-Iumlnescent compound, a metal chelator, or an enzyme. 
Those of ordinary skill in the art will know of other suitable labels for binding to the ditag, or will be able to ascertain 
such, using routine experimentation. For example, PCR can be performed with labelled (e.g., fluorescein tagged) prim- 
ers. Preferably, the ditag contains a fluorescent end label. 

[0047] The labeled or unlabeled ditags are separated into single-stranded molecules which are preferably serially 

10 diluted and added to a solid support ( e.g., a silicon chip as described by Fodor, etal, Science, 251:767, 1991) containing 
oligonucleotides representing, for example, every possible permutation of a 10-mer (e.g., in each grid of a chip). The 
solid support Is then used to determine differential expression of the tags contained within the support (e.g., on a grid 
on a chip) by hybridization of the oligonucleotides on the solid support with tags produced from cells under different 
conditions (e.g., different stage of development, growth of ceils in the absence and presence of a growth factor, normal 

is versus transformed ceils, comparison of different tissue expression, etc). In the case of fluoresceinated end labeled 
ditags, analysis of fluorescence is Indicative of hybridization to a particular 10-mer. When the Immobilized oligonucle- 
otide is fluoresceinated for example, a loss of fluorescence due to quenching (by the proximity of the hybridized ditag 
to the labeled oligo) is observed and is analyzed for the pattern of gene expression. An illustrative example of the 
method is shown In Example 4 herein. 

20 [0048] The SAGE method of the invention Is also useful for clonal sequencing, similar to limiting dilution techniques 
used in ctoning of cell lines. For example, ditags orconcatemers thereof, are diluted and added to Individual receptacles 
such that each receptacle contains less than one DNA molecule per receptacle. DNA in each receptacle is amplified 
and sequenced by standard methods known in the art, including mass spectroscopy. Assessment of differential ex- 
pression is performed as described above for SAGE 

25 [0049] Those of skill In the art can readily determine other methods of analysis for ditags produced by SAGE as 
described in the present invention, without resorting to undue experimentation. 

[0050] The concept of deriving a defined tag from a sequence in accordance with the present invention Is useful In 

matching tags of samples to a sequence database. In the preferred embodiment, a computer method Is used to match 

a sample sequence with known sequences. 
30 [0051] In one embodiment, a sequence tag for a sample is compared to corresponding information in a sequence 

database to identify known sequences that match the sample sequence. One or more tags can be determined for each 

sequence in the sequence database as the N base pairs adjacent to each anchoring enzyme site within the sequence. 

However, in the preferred embodiment, onfy the first anchoring enzyme site from the 3* end is used to determine a tag. 

In the preferred embodiment, the adjacent base pairs defining a tag are on the 3* side of the anchoring enzyme site, 
35 and N is preferably 9. 

[0052] A linear search through such a database may be used. However, in the preferred embodiment, a sequence 
tag from a sample is converted to a unique numeric representation by converting each base pair (A, C, G, or T) of an 
fV-base tag to a number or "tag code" (e.g., A=0, C=1 , G=2, T=3, or any other suitable mapping). A tag Is determined 
for each sequence of a sequence database as described above, and the tag is converted to a tag code in a similar 
40 manner. In the preferred embodiment, a set of tag codes for a sequence database is stored In a pointer file. The tag 
code for a sample sequence is compared to the tag codes In the pointer file to determine the location In the sequence 
database of the sequence corresponding to the sample tag code. (Multiple corresponding sequences may exist If the 
sequence database has redundancies). 

[0053] FIGURE 4 is a block diagram of a tag code database access system in accordance with the present invention. 

<s a sequence database 10 (e.g., the Human Genome Sequence Database) is processed as described above, such that 
each sequence has a tag code determined and stored In a pointer file 1 2. A sample tag codeXfor a sample is determined 
as described above, and stored within a memory location 14 of a computer. The sample tag code X is compared to 
the pointer file 12 for a matching sequence tag code. If a match Is found, a pointer associated with the matching 
sequence tag code Is used to access the corresponding sequence in the sequence database 10. 

so [0054] The pointer file 1 2 may be in any of several formats. In one format, each entry of the pointer file 1 2 comprises 
a tag code and a pointer to a corresponding record in the sequence database 12. The sample tag code X can be 
compared to sequence tag codes in a linear search. Alternatively, the sequence tag codes can be sorted and a binary 
search used. As another alternative, the sequence tag codes can be structured in a hierarchical tree structure (e.g., a 
B-tree), or as a singly or doubly linked list, or in any other conveniently searchable data structure or format 

55 [0055] In the preferred embodiment, each entry of the pointer file 12 comprises only a pointer to a corresponding 
record In the sequence database 10. in building the pointer file 12, each sequence tag code is assigned to an entry 
position in the pointer file 12 corresponding to the value of the tag code. For example, if a sequence tag code was 
"1043*, a pointer to the corresponding record in the sequence database 10 would be stored In entry #1043 of the 
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pointer file 12. The value of a sample tag code X can be used to directly address the location In the pointer rile 12 that 
corresponds to the sample tag code X, and thus rapidly access the pointer stored in that location in order to address 
the sequence database 10. 

[0055] Because only four values are needed to represent all possible base pairs, using binary coded decimal (BCD) 

s numbers for tag codes (n conjunction with the preferred pointer file 1 2 structure leads to a "sparse" pointer file 1 2 that 
wastes memory or storage space. Accordingly, the present invention transforms each tag code to number base 4 (/. 
e., 2 bits per code digit), in known fashion, resulting in a compact pointer file 12 structure. For example, for tag sequence 
"AGCT, with A=002, C=01 2 , G=10 2 , T=11 2 , the base four representation In binary would be "00011011". In contrast, 
the BCD representation would be "00000000 00000001 00000010 000000011". Of course, It should be understood 

10 that other mappings of base pairs to codes would provide equivalent function. 

[0057] The concept of deriving a defined tag from a sample sequence in accordance with the present invention is 
also useful In comparing different samples for similarity. In the preferred embodiment, a computer method is used to 
match sequence tags from different samples. For example, in comparing materials having a large number of sequences 
(e.g., tissue), the frequency of occurrence of the various tags in a first sample can be mapped out as tag codes stored 

is in a distribution or histogram-type data structure. For example, a table structured similar to pointer file 12 in FIGURE 
4 can be used where each entry comprises a frequency of occurrence value. Thereafter, the various tags in a second 
sample can be generated, converted to tag codes, and compared to the table by directly addressing table entries with 
the tag code. A count can be kept of the number of matches found, as well as the location of the matches, for output 
in text or graphic form on an output device, and/or for storage In a data storage system for later use. 

20 [0058] The tag comparison aspects of the invention may be implemented in hardware or software, or a combination 
of both. Preferably, these aspects of the invention are implemented in computer programs executing on a programmable 
computer comprising a processor, a data storage system (including volatile and non-volatile memory and/or storage 
elements), at least one input device, and at least one output device. Data input through one or more Input devices for 
temporary or permanent storage in the data storage system includes sequences, and may include previously generated 

25 tags and tag codes for known and/or unknown sequences. Program code is applied to the input data to perform the 
functions described above and generate output information. The output information is applied to one or more output 
devices, in known fashion. 

[0059] Each such computer program is preferably stored on a storage media or device (e.g., ROM or magnetic 
diskette) readable by a general or special purpose programmable computer, for configuring and operating the computer 
30 when the storage media or device is read by the computer to perform the procedures described herein. The inventive 
system may also be considered to be implemented as a computer-readable storage medium, configured with a com- 
puter program, where the storage- medium so configured causes a computer to operate in a specific and predefined 
manner to perform the functions described herein. 

[0060] The following examples are intended to illustrate but not limit the invention. While they are typical of those 
35 that might be used, other procedures known to those skilled in the art may alternatively be used. 

EXAMPLES 

[0061] For exemplary purposes, the SAGE method of the invention was used to characterized gene expression In 

40 the human pancreas. Nlalll was utilized as the first restriction endonuciease, or anchoring enzyme, and BsmFI as the 
second restriction endonuciease, or tagging enzyme, yielding a 9 bp tag (BsmFI was predicted to cleave the comple- 
mentary strand 14 bp 3' to the recognition site GGGAC and to yield a 4 bp 5' overhang (New England BioLabs). 
Overlapping the BsmFI and Nlalll (CATG) sites as indicated (GGG ACATG) would be predicted to result in a 1 1 bp tag. 
However, analysis suggested that under the cleavage conditions used (37°C), BsmFI often cleaved closer to its reo- 

45 ognition site leaving a minimum of 12 bp 3' of its recognition site. Therefore, only the 9 bp closest to the anchoring 
enzyme site was used for analysis of tags. Cleavage at 65° C results in a more consistent 11 bp tag. 
[0062] Computer analysis of human transcripts from Gen Bank indicated that greater than 95% of tags of 9 bp in 
length were likely to be unique and that inclusion of two additional bases provided little additional resolution. Human 
sequences (84,300) were extracted from the GenBank 87 database using the Findseq program provided on the Intel- 

so liGenetlcs Bionet on-line service. All further analysis was performed with a SAGE program group written in Microsoft 
Visual Basic for the Microsoft Windows operating system. The SAGE database analysis program was set to include 
only sequences noted as "RNA" in the locus description and to exclude entries noted as "EST, resulting In a reduction 
to 1 3,241 sequences. Analysis of this subset of sequences using Nlalll as anchoring Enzyme indicated that 4,127 nine 
bp tags were unique while 1 ,511 tags were found in more than one entry. Nucleotide comparison of a randomly chosen 

55 subset (100) of the latter entries indicated that at least 83% were due to redundant data base entries for the same 
gene or highly related genes (>95% Identity over at least 250 bp). This suggested that 5381 of the 9 bp tags (95.5%) 
were unique to a transcript or highly conserved transcript family. Likewise, analysis of the same subset of GenBank 
with an 11 bp tag resulted only in a 6% decrease in repeated tags (1511 to 1425) instead of the 94% decrease expected 
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ff the repeated tags were due to unrelated transcripts. 
EXAMPLE I 

[0063] As outlined above, mRNA from human pancreas was used to generate ditags. Briefly, five ug mRNA from 
total pancreas (Clontech) was converted to double stranded cDNA using a BRL cDNA synthesis kit following the man- 
ufacturer's protocol, using the primer biotln-5T lff -3\ The cDNA was then cleaved with Nlalll and the 3' restriction 
fragments isolated by binding to magnetic streptavidin beads (Dynal). The bound DNA was divided into two pools, and 
one of the following linkers Ilgated to each poo!: 

5-TTTTACCAGCTTATTCAATTCGGTC -3' 
3*- ATGGTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT • -5' 

(SEQ ID NO: 1 and 2) 



y. TTTTTGTAGACATTCTAGTATCTCGTCAAGTCGGAAGGGACATG -3' 
3- ^ACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT -5 
(SEQIDNO:3and4X 



where A is a dideoxy nucleotide (e.g., dideoxy A). 

[0064] After extensive washing to remove unligated linkers, the linkers and adjacent tags were released by cleavage 
with BsmFI. The resulting overhangs were filled in with T4 polymerase and the pools combined and ligated to each 
other. The desired ligation product was then amplified for 25 cycles using 5-CCAGCTTATTCAATTCGGTCC-3' and 
5'-GTAGACATTCTAGTATCTCGT-3' (SEQ ID NO:5 and 6, respectively) as primers. The PCR reaction was then ana- 
lyzed by poiyacrylamlde gel electrophoresis and the desired product excised. An additional 15 cycles of PCR were 
then performed to generate sufficient product for efficient ligation and cloning. 

[0065] The PCR dftag products were cleaved with Nlalll and the band containing the ditags was excised and setf- 
ligated. After ligation, the concatenated ditags were separated by poiyacrylamlde gel electrophoresis and products 
greater than 200 bp were excised. These products were cloned into the Sphl site of pSL301 (Invltrogen). Colonies 
were screened for inserts by PCR using T7 and T3 sequences outside the cloning site as primers. Clones containing 
at least 10 tags (range 10 to 50 tags) were Identified by PCR amplification and manually sequenced as described (Del 
Sal, ef a/., Blotechnlques 7:514, 1989) using 5'-GACGTCGACCTGAGGTAATTATAACC-3' (SEQ ID NO:7) as primer. 
Sequence files were analyzed using the SAGE software group which identifies the anchoring enzyme site with the 
proper spacing and extracts the two intervening tags and records them in a database. The 1,000 tags were derived 
from 413 unique ditags and 87 repeated ditags. The latter were only counted once to eliminate potential PCR bias of 
the quantitation. The function of SAGE software Is merely to optimize the search for gene sequences. 
[0066] Table 1 shows analysis of the first 1 ,000 tags. Sixteen percent were eliminated because they either had se- 
quence ambiguities or were derived form linker sequences. The remaining 840 tags included 351 tags that occurred 
once and 77 tags that were found multiple times. Nine of the ten most abundant tags matched at least one entry In 
GenBank R87. The remaining tag was subsequently shown to be derived from amylase. All ten transcripts were derived 
from genes of known pancreatic function and their prevalence was consistent with previous analyses of pancreatic 
RNA using conventional approaches (Han, et al., Proc. Natl. Acad. Sci USA. 83:110, 1 986; Takeda, et a/. f Hum. Mol. 
Gen., 21793, 1993). 
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TABLE 1 
Pancreatic SAGE Tags 



OAGCACACC Procarboxypeptidase Al (X67318) 64 7,6 

TTCTOTGTG Paacmtic Trypsinogcn 2 (M27602) 46 S.5 

OAACACAAA Chyxnottypsmogen (M24400) 37 4.4 

TCAGOGTGA Pancreatic Trypsin 1 (M22612) 31 3.7 

GCGTOACCA Elastasc MB (M18692) 20 2.4 

GTGTGTGCT Protease E (D00306) 16 1.9 

TCATTGGCC Paccreaiic Lipase (M93285) 16 1.9 

CCAGAGAGT Procarboxypeptklase B (M8 1 057) 14 1.7 

TCCTCAAAA No Match, See Table 2, Pi 14 * 1.7 

AGCCTTGOT BDe Salt Stimulated Lipase (X54457) 12 1.4 

GTOTGCGCT No Match U 1.3 

TGCGAGACC No Match, See Table 2, P2 9 1. 1 

GTOAAACCC 21 Alu entries 8 1.0 

GOTGACTCT No Match 8 1.0 

AAGGTAACA Secretary Trypsin Inhibitor (Ml 1949) 6 0.7 

TCCCCTGTO No Match 5 0.6 

GTGACCACG No Match 5 0.6 

CCTOTAATC M9U59,M29366,lt AJu entries 5 0.6 

CACOTTGGA No Match 5 0.6 

AGCCCTACA No Match 5 0.6 

AGCACCTCC Elongation Factor 2(211692) 5 0.6 

ACGCAGGGA No Match, See Table 2, P3 5 0.6 

AATTGAAGA No Match, See Table 2, P4 5 0.6 

TTCTGTOGO No Match 4 0.5 

TTCATACAC No Match 4 0.5 

GTGGCAGOC NF-kB<rai499), Ahi entry (S9454I) 4 0.5 
GTAAAACCC TNF receptor U (M55994)> 

Alu entry (JC01448) 4 0.5 

GAACACACA No Match 4 0.5 

CCTGGGAAG Pancreatic Mucin (J05582). 4 OJ 

CCCATCGTC Mitochondrial CytC Oxidase (XI 57S9) 4 0.5 
(SEQ ID NO:8-37) 

Smnnag 

SAGE tags Greater than three times 380 45.2 

Occurring Three times (15x3«) 45 5.4 

Two times (32x2=) 64 7.6 

Occtinre 22] 4LS 

Total SAGE Tags 840 100.0 



[0067] "Tag" Indicates the 9 bp sequence unique to each tag, adjacent to the 4 bp anchoring NIalll site. "1ST and 
"Percent" indicates the number of times the tag was identified and its frequency, respectively. n Gene° indicates the 
accession number and description of GenBank R87 entries found to match the indicated tag using the SAGE software 
group with the following exceptions. When multiple entries were identified because of duplicated entries, only one entry 
is listed. In the cases of chymotrypslnogen, and trypsinogen 1, other genes were identified that were predicted to 
contain the same tags, but subsequent hybridization and sequence analysis identified the fisted genes as the source 
of the tags. "Alu entry" indicates a match with a GenBank entry for a transcript that contained at least one copy of the 
aiu consensus sequence (Deininger, etal., J. Mol. Biol., 151:17. 1981). 
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EXAMPLE 2 

[006BJ The quantitative nature of SAGE was evaluated by construction of an oligo-dT primed pancreatic cDN A library 
which was screened with cDNA probes for trypsinogen 1/2, procarboxpeptldase A1, chymotrypslnogen and elastase 
MlB/protease E. Pancreatic rnRNA from the same preparation as used for SAGE in Example 1 was used to construct 
a cDNA library in the ZAP Express vector using the ZAP Express cDNA Synthesis kit following the manufacturer's 
protocol (Stratagene). Analysis of 15 randomly selected clones Indicated that 100% contained cDNA inserts. Plates 
containing 250 to 500 plaques were hybridized as previously described (Ruppert, et a/., Mol. Cell Biol 8:3104, 1988). 
cDNA probes for trypslnogen 1, trypslnogen 2, procarboxypeptidase A1, chymotrypsinogen, and elastase IIIB were 
derived by RT-PCR from pancreas RNA. The trypslnogen 1 and 2 probes were 93% identical and hybridized to the 
same plaques under the conditions used. Likewise, the elastase (I IB probe and protease E probe were over 95% 
identical and hybridized to the same plaques. 

[0069] The relative abundance of the SAGE tags for these transcripts was In excellent agreement with the results 
obtained with library screening (Figure 2). Furthermore, whereas neither trypslnogen 1 and 2 nor elastase IIIB and 
protease E could be distinguished by the cDNA probes used to screen the library, all four transcripts could readily be 
distinguished on the basis of their SAGE tags (Table 1). 

EXAMPLE 3 

[0070] In addition to providing quantitative Information on the abundance of known transcripts, SAGE could be used 
to identify novel expressed genes. While forthe purposes of the SAGE analysis in this example, only the 9 bp sequence 
unique to each transcript was considered, each SAGE tag defined a 13 bp sequence composed of the anchoring 
enzyme (4 bp) site plus the 9 bp tag. To illustrate this potential, 1 3 bp oligonucleotides were used to isolate the transcripts 
corresponding to four unassigned tags (P1 to P4), that is, tags without corresponding entries from GenBank R87 (Table 
1). In each of the four cases, it was possible to isolate multiple cDNA clones for the tag by simply screening the pan- 
creatic cDNA library using 13 bp oligonucleotide as hybridization probe (examples In Figure 3). 
[0071] Plates containing 250 to 2,000 plaques were hybridized to oligonucleotide probes using the same conditions 
previously described for standard probes except that the hybridization temperature was reduced to room temperature. 
Washes were performed in 6xSSC/0.1% SDS for 30 minutes at room temperature. The probes consisted of 13 bp 
oligonucleotides which were labeled with ^P-ATP using T4 polynucleotide kinase. In each case, sequencing of the 
derived clones identified the correct SAGE tag at the predicted 3* end of the identified transcript The abundance of 
plaques identified by hybridization with the 1 3-mers was in good agreement with that predicted by SAGE (Table 2). 
Tags P1 and P2 were found to correspond to amylase and preprocarboxypeptidase A2, respectively. No entry for 
preprocarboxypeptldase A2 and only a truncated entry for amylase was present in GenBank R87, thus accounting for 
their unassigned characterization. Tag P3 did not match any genes of known function in GenBank but did match nu- 
merous ESTs, providing further evidence that it represented a bona fide transcript The cDNA identified by P4 showed 
no significant homology, suggesting that It represented a previously uncharacterized pancreatic transcript. 
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TABLE 2 



PI TCCTCAAAA 
(SEQIDNO:38) 
P2 TGCGAGACC 




end of Pancreatic Amylase (M28443) 



1.1% 



1.2% (43/3700) + 3' end of Prepnxaxboxypcptidase A2 

(U19977) 



(SEQIDNO:39) 



P3 ACGCAGGGA 



0.6% 



0.2% (5/2772) + EST match (R45808) 



(SEQIDNO:40) 



P4AATTGAAGA 



0.6% 



0.4% (6/1587) + no match 



(SEQIDKO:41) 



[0072] 'Tag" and "SAGE Abundance" are described in Table 1 ; "1 3mer Hyb" indicates the results obtained by screen- 
ing a cDNA library with a 13mer, as described above. The number of positive plaques divided by the total plaques 
screened is indicated in parentheses following the percent abundance. A positive in the "SAGE Tag" column indicates 
that the expected SAGE tag sequence was identified near the 3' end of isolated clones. "Description" indicates the 
results of BLAST searches of the daily updated GenBank entries at NCBI a of 6/9/95 (Aitschul, eta!., J. Mot. B/o/., 215 : 
403, 1990). A description and Accession number are given for the most significant matches. P1 was found to match 
a truncated entry for amylase, and 92 was found to match an unpublished entry for preprocarboxypeptldase A2 which 
was entered after GenBank R87. 



[0073] Ditags produced by SAGE can be analyzed by PSA or CS, as described in the specification. In a preferred 
embodiment of PSA, the following steps are carried out with ditags: 

Ditags are prepared, amplified and cleaved with the anchoring enzyme as described in the previous examples. 



Four-base oligomers containing an identifier [e.g., a fluorescent moiety, PL) are prepared that are complementary 
to the overhangs, for example, FL-CATG. The FL-CATG oligomers (in excess) are ligated to the ditags as shown 
below: 



EXAMPLE 4 



OOOOOOOOOOXXXXXXXXXXCATGw3 , 



3'-GTACOOOOOOOOOOXXXXXXXXXX 



S^FL^ATGOOOOOCX^OOOXXXXXXXXXXCATG 



GTACOOOOOOOOOOXXXXXXXXXXGTAC-FL-5 1 
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Tho dltags are then purified and melted to yield s!rtg!e-stranded DNAs having the formula: 



S^FL-CATGOOOOOOOOOOOXXXXXXXXXXCATG ; 



and 



GTACOOOOOOOOOOXXXXXXXXXXGTAC-FL•5^ 

for example. The mixture of single-stranded DNAs is preferably serially diluted. 

[0074] Each serial dilution is hybridized under appropriate stringency conditions with solid matrices containing grid- 
ded single-stranded oligonucleotides; all of the oligonucleotides contain a half-site of the anchoring enzyme cleavage 
sequence. In the example used herein, the oligonucleotide sequences contain a CATG sequence at the 5 f end: 



CATGOOOOOOOOOO, CATGXXXXXXXXXX, 

etc 

(or alternatively a CATG sequence at the 3' end: OOOOOOOOOCATG) 

[0075] The matrices can be constructed of any material known in the art and the ollgonucleotide-bearing chips can 
be generated by any procedure known In the art, e.g. silicon chips containing oligonucleotides prepared by the VLSIP 
procedure (Fodor et al., supra). 

[0076] The ollgonucleotide-bearing matrices are evaluated for the presence or absence of a fluorescent dltag at each 
position in the grid. 

[0077] In a preferred embodiment, there are 4 10 , or 1 ,048,576, oligonucleotides on the grid(s) of the genera) sequence 
CATGOOOOOOOOOO, such that every possible 10-base sequence is represented 3* to the CATG, where CATG is 
used as an example of an anchoring enzyme half site that Is complementary to the anchoring enzyme half site at the 
3' end of the dftag. Since there are estimated to be no more than 100,000 to 200,000 different expressed genes in the 
human genome, there are enough oligonucleotide sequences to detect all of the possible sequences adjacent to the 
3 -most anchoring enzyme site observed in the cDNAs from the expressed genes in the human genome. 
[0078] In yet another embodiment, structures as described above containing the sequences 

PRIMER A- GGAGCATG (X) 10 (O) l0 CATGCATCC- PRIMER B 
PRIMER A- CCTCGTAC (X) l0 (O) l0 GTACGTAGG- PRIMER B 

are amplified, cleaved with tagging enzyme and thereafter with anchoring enzyme to generate tag complements of the 
structure: (O) 10 CATG-3', which can then be labeled, melted, and hybridized with oligonucleotides on a solid support 
[0079] A determination is made of differential expression by comparing the fluorescence profile on the grids at dif- 
ferent dilutions among different libraries (representing differential screening probes). For example: 
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library A, Ditags Diluted 1:10 



Library B, Ditags Diluted 1:10 
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Library A, Ditags Diluted 1:50 



Library A, Ditags Diluted 1:100 
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Library B, Ditags Diluted 1:50 



Library B, Ditags Diluted 1:100 
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[0080] The Individual oligonucleotides thus hybridize to ditags with the following characteristics: 

Table 3 



Dilution 


1:10 


1:50 


1:100 


Lib A 


LibB 


Lib A 


UbB 


Lib A 


LibB 
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Table 3 (continued) 



Dilution 


1:10 


1:50 


1:100 


Lib A 


Lib B 


Lib A 


UbB 


UbA 


Lib B 


3B 
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+ 
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[0081] Table 3 summarizes the results of the differential hybridization. Tags hybridizing to 1A and 3B reflect highly 
abundant mRNAs that are not differentially expressed (since the tags hybridize to both libraries at a!) dilutions); tag 2C 
identifies a highly abundant mRNA, but only in Library B. 2E reflects a low abundance transcript (since it Is only detected 
at the lowest dilution) that is not found to be differentially expressed; 3C reflects a moderately abundant transcript 
(since It Is expressed at the lower two dilutions) In Library B that is expressed at tow abundance in Library A. 4D reflects 
a differentially-expressed, high abundance transcript restricted to Library A; 5A reflects a transcript that is expressed 
at high abundance In Library Abut only at low abundance In Library B; and 5E reflects a differentially-expressed tran- 
script that Is detectable only in Library B. In another PSA embodiment, step 3 above does not involve the use of a 
fluorescent or other identifier; instead, at the last round of amplification of the ditags, labeled dNTPs are used so that 
after melting, half of all molecules are labeled and can serve as probes for hybridization to oligonucleotides fixed on 
the chips. 

[0082] For use In clonal sequencing, ditags or concatemers would be diluted and added to wells of multiwell plates, 
for example, or other receptacles so that on average the wells would contain, statistically, less than one DNA molecule 
per well (as is done In limited dilution for cell cloning). Each well would then receive reagents for PCR or another 
amplification process and the DNA In each receptacle would be sequenced, e.g., by mass spectroscopy. The results 
will either be a single sequence (there having been a single sequence In that receptacle), a "null" sequence (no DNA 
present) or a double sequence (more man one DNA molecule), which would be eliminated from consideration during 
data analysis. Thereafter, assessment of differential expression would be the same as described herein. 
[0083] These results demonstrate that SAGE provides both quantitative and qualitative data about gene expression. 
The use of different anchoring enzymes and/or tagging enzymes with various recognition elements lends great flexibility 
to this strategy. In particular, since different anchoring enzymes cleave cDNA at different sites, the use of at least 2 
different Aes on different samples of the same cDNA preparation allows confirmation of results and analysis of se- 
quences that might not contain a recognition site for one of the enzymes. 

[0084] As efforts to fully characterize the genome near completion, SAGE should allow a direct readout of expression 
in any given cell type or tissue. In the interim, a major application of SAGE will be the comparison of gene expression 
patterns In among tissues and In various developmental and disease states In a given cell or tissue. One of skill in the 
art with the capability to perform PCR and manual sequencing could perform SAGE for this purpose. Adaptation of 
this technique to an automated sequencer would allow the analysis of over 1,000 transcripts In a single 3 hour run. An 
ABI 377 sequencer can produce a 451 bp readout for 36 templates in a 3 hour run (451 bp/11 bp per tag x 36=1476 
tags). The appropriate number of tags to be determined will depend on the application. For example, the definition of 
genes expressed at relatively high levels (0.5% or more) in one tissue, but low In another, would require only a single 
day. Determination of transcripts expressed at greater than 1 00 mRNA's per cell (.025% or more) should be quantifiable 
within a few months by a single investigator. Use of two different Anchoring Enzymes will ensure that virtually ail 
transcripts of the desired abundance will be Identified. The genes encoding those tags found to be most Interesting on 
the basis of their differential representation can be positively identified by a combination of data-base searching, hy- 
bridization, and sequence analysis as demonstrated in Table 2. Obviously, SAGE could also be applied to the analysis 
of organisms other than humans, and could direct investigation towards genes expressed in specific biologic states. 
[0085] SAGE, as described herein, allows comparison of expression of numerous genes among tissues or among 
different states of development of the same tissue, or between pathologic tissue and its normal counterpart. Such 
analysis is useful for identifying therapeutically, diagnostically and prognosticaily relevant genes, for example. Among 
the many utilities for SAGE technology, Is the identification of appropriate antisense or triple helix reagents which may 
be therapeutically useful. Further, gene therapy candidates can also be identified by the SAGE technology. Other uses 
include diagnostic applications for identification of individual genes or groups of genes whose expression is shown to 
correlate to predisposition to disease, the presence of disease, and prognosis of disease, for example. An abundance 
profile, such as that depicted in Table 1 , is useful for the above described applications. SAGE is also useful for detection 
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of an organism (e.g., a pathogen) in a host or detection of Infection-specific genes expressed by a pathogen In a host 
[0086] The ability to identify a large number of expressed genes in a short period of time, as described by SAGE In 
the present Invention, provides unlimited uses. 

[0087] Although the invention has been described with reference to the presently preferred embodiment, it should 
s be understood that various modifications can be made without departing from the spirit of the invention. Accordingly, 
the invention Is limited only by the following claims. 



Claims 

10 

1. An isolated ditag oligonucleotide comprising at least two defined nucleotide sequence tags, wherein the defined 
nucleotide sequence tags comprise sequence 5 1 of a 5-most cleavage site of a restriction endonuclease or 3' of 
a 3 T -most cleavage site of a restriction endonuclease in a full length cDNA, wherein each tag corresponds to an 
expressed gene. 

1$ 

2. The composition of claim 1 , wherein the oligonucleotide consists of 1 to 200 ditags. 

3. The composition of claim 2, wherein the oligonucleotide consists of 8 to 20 ditags. 

20 4. A method for the detection of gene expression comprising: 

producing complimentary deoxyribonucleic acid (cDNA) oligonucleotides from mRNA of a cell which contains 
an expressed gene; 

isolating a first nucleotide sequence tag from a first cDNA oligonucleotide and a second nucleotide sequence 
25 tag from a second cDNA oligonucleotide wherein the nucleotide sequence tags comprise sequence 5' of a 5'- 

most cleavage site of a first restriction endonuclease or 3' of a 3-most cleavage site of a first restriction en- 
donuclease in a full-length cDNA; 

linking the first tag to a first oligonucleotide linker, wherein the first oligonucleotide linker comprises a first 
sequence for hybridisation of an amplification primer and linking the second tag to a second oligonucleotide 
30 linker, wherein the second oligonudeotide linker comprises a second sequence for hybridisation of an ampli- 

fication primer; and 

ligating the first tag linked to the first oligonucleotide linker to the second tag linked to the second oligonucle- 
otide linker to form a ditag; 

determining the nucleotide sequence of the ditag, wherein identification of a first or second teg In a ditag 
35 Indicates that a gene which corresponds to the first or second tag is expressed in thecell. 

5. The method of claim 4, further comprising amplifying the ditag oligonucleotide.. ■ 

6. The method of claim 5, further comprising cleaving the ditag with the first restriction endonuclease and ligating the 
40 cleaved ditags to form concatemers of the ditags. 

7. The method of claim 6, wherein the concatemer consists of 2 to 200 ditags. 

8. The method of claim 7, wherein the concatemer consists of 8 to 20 ditags. 

43 

9. The method of any of claims 4 to 8, wherein the first and second oligonucleotide linkers comprise the same nu- 
cleotide sequences. 

10. The method of any of claims 4 to 8, wherein the first and second oligonucleotide linkers comprise different nucle- 
50 otlde sequences. 

11. The method of claim 10, wherein the first and second oligonucleotide linkers have a sequence: 



55 
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5-TTTTACCAGCTTATTCAA7TCGGTCCTCTCGCACAGGGACATG-3' 

3'-ATGGTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT-5' 
or 

or 

5 , -TTTTTGTAGACATTCTAGTATCTCGTCAAGTCGGAAGGGACATG-3 , 
3 , -AACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT-5 , l 

wherein A is dideoxy A. 

12. The method of any of claims 4 to 11, wherein the linkers comprise a second restriction endonuclease recognition 
site which allows cleavage at a site distant from the recognition site. 

13. The method of claim 12, wherein the second restriction endonuclease Is a type IIS endonuclease. 

14. The method of claim 13, wherein the type IIS endonuclease is selected from the group consisting of BsmFl and Fokl. 

15. The method of any of claims 4 to 14, wherein the ditag is 12 to 60 base pairs. 

16. The method of claim 15, wherein the ditag is 18 to 22 base pairs. 

17. The method of any of claims 5 to 16, wherein the amplifying is by polymerase chain reaction (PCR). 

18. The method of claim 17, wherein primers for PCR are selected from the group consisting of: 



S^CAGCTTATTCAATTCGGTCC^' 

and 

5-GTAGACATTCTAGTATCTCGT-3 1 . 



19. A method for detection of gene expression comprising: 

cleaving a cDNA sample derived from mRNA of a cell which expresses a gene with a first restriction endonu- 
clease, wherein the endonuclease cleaves the cDNA at a defined position at the 5* or 3* terminus of the cDNA 
thereby producing defined sequence tags; 

isolating a 5' or 3' cDNA tag located between the defined position and the adjacent terminus; 
ligating a first pool of tags with a first oligonucleotide linker having a first sequence useful for hybridisation to 
an amplification primer and ligating a second pool of tags with a second oligonucleotide linker having a second 
sequence useful for hybridisation to an amplification primer, wherein each primer comprises a recognition site 
for a second restriction endonuclease, wherein the second restriction endonuclease cleaves at a site distant 
from the recognition site; 

cleaving the tegs with second restriction endonuclease; 
ligating the two pools of tags to produce ditags; 

determining the nucleotide sequence of a ditag, wherein identification of a first or second tag in a ditag indicates 
that a gene which corresponds to the first or second tag is expressed in the ceil. 
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20. The method of claim 19, further comprising amplifying the dltag. 

21 . The method of claim 20, wherein the first restriction endonuclease enzyme has a four base pair recognition site. 
5 22. The method of claim 21, wherein the first restriction endonuclease Is Nlalll. 

23. The method of any of claims 19 to 22, wherein the cDNA comprises e means for capture. 

24. The method of claim 23, wherein the means for capture is e binding element. 

10 

25. The method of claim 24, wherein the binding element Is blotin. 

26. The method of any of claims 19 to 25 wherein the first and second oligonucleotide linkers comprise the same 
nucleotide sequences. 

15 

27. The method of any of claims 1 9 to 25 wherein the first and second oligonucleotide linkers comprise different nu- 
cleotide sequences. 

28. A method of claim 27, wherein the first and second oligonucleotide linkers have a sequence: 

20 

5-TTTTACCAGCTTATTCAATTCGGTCCTCTCGCACAGGGACATG- 
3' 

25 3-ATGGTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT-5 , 

or 

30 M i ll I G TAGACATTCTAGTATCTCGTCAAGTCGG AAG GG ACATG-3' 

S'-AACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT-S' 

wherein A is dideoxy A. 

35 

29. The method of any of claims 19 to 28, wherein the second restriction endonuclease is a type IIS endonuclease. 

30. The method of claim 29 wherein the type I IS endonuclease is selected from the group consisting of BsmFI and Fokl. 
to 31. The method of any of claims 1 9 to 30, wherein the dltag Is 12 to 60 base pairs. 

32. The method of claim 31, wherein the ditag is 14 to 22 base pairs. . 

33. The method of any of claims 19 to 32, further comprising ligating the ditags to produce a concatemer. 

45 

34. The method of claim 33, wherein the concatemer consists of 2 to 200 ditags. 

35. The method of claim 34, wherein the concatemer consists of 8 to 20 ditags. 

50 36. The method of any of claims 20 to 35, wherein the amplifying is by polymerase chain reaction (PGR). 

37. The method of claim 36, wherein primers for PCR are selected from the group consisting of: 



5'-CCAGCTTATTCAATTCGGTCC~3' 



and 
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5'-GTAGACATTCTAGTATCTCGT-3\ 



38. A kit useful for detection of gene expression wherein the presence of a cDNA dltag Is Indicative of expression of 
a gene having a sequence of a tag of the dltag, the kit comprising a first container containing a first oligonucleotide 
linker having a first sequence useful for hybridisation to an amplification primer, a second container containing a 
second oligonucleotide linker having a second sequence useful for hybridisation to an amplification primer, wherein 
the linkers further comprise a restriction endonuclease site for cleavage of DNA at a site distant from the restriction 
endonuclease recognition site, a third and fourth container having nucleic acid primers for hybridisation to the first 
and second sequences of the linker, and a fifth and a sixth container containing a iigase and, optionally, a second 
restriction endonuclease which cleaves DNA at its recognition site. 

39. The kit of claim 38, wherein the linkers have a sequence 

5 , -TTTTACCAGCTTATTCAATTCGGTCCTCTCGCACAGGGACATG-3 , 
3 , -ATGGTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT-5 , 

or 

5 -TTTTTG TAG ACATTCTAGTATCTCGTCAAGTC GGAAGG G ACATG-3* 
3 , -AACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT-5 , 

wherein A Is dideoxy A 

40. The kit of claims 38 or 39, wherein the restriction endonuclease is a type IIS endonuclease. 

41. The kit of claim 40, wherein the type IIS endonuclease is BsmFI. 

42. The kit of any of claims 38 to 41 , wherein the primers for amplification are selected from the group consisting of 



5'-CCAGCTTATTCAATTCGGTCC-3' 



and 

5-GTAGACATTCTAGTATCTCGT-3\ 



43. The isolated ditag oligonucleotide of claim 1 wherein the two defined nucleotide sequence tags are joined In a tail- 
to-tail fashion. 

44. The Isolated ditag oligonucleotide of claim 1 wherein the ditags comprise cleaved cleavage sites for a restriction 
endonuclease at each terminus. 



PatentansprGche 

1. Isoliertes Doppelmarker-Oligonucieotid, umfassend wenigstens zwel definierte Nucleotidsequenzmarker, wobei 
die definlerten Nucleotidsequenzmarker Sequenz 5* von einer 5-nachsten Spaltstelle einer Restriktionsendonu- 
ciease Oder 3' von einer 3'-n§ chsten Spaltstelle einer Restriktionsendonuclease In einer VofflSnge-cDNA umfassen, 
wobei jeder Marker einem exprimierten Gen entspricht. 
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2. Zusammensetzung nach Anspruch 1 , wobel das Ol;gonuc!eotid aus 1 bis 2C0 Doppelmarkern bestehi 

3. Zusammensetzung nach Anspruch 2, wobel das Oligonucleotld aus 8 bis 20 Doppelmarkern besteht. 

s 4. Verfahren zum Nachweis von Genexpression umfassend: 

Herstellung von Ollgonucleotiden komplementirer DesoxyribonucIeins§ure (cDNA) aus mRNA einsr Zelle, 
die ein exprimiertes Gen enthalt; 

Isolierung eines ersten Nucleotidsequenzmarkers aus einem ersten cDNA-OIIgonucIeotid und eines zweiten 
10 Nucleotidsequenzmarkers aus elnem zweiten cDNA-OIigonucIeotid f wobei die Nudeotidsequenzmarker Se- 

quenz 5' von einer 5'-nachsten Spaltstelle einer ersten Restriktionsendonuclease oder 3' von einer 3 -nachsten 
Spaftstelle einer ersten Restriktionsendonuclease In einer VoIIlangen-cDNA umfassen; 
Verblnden des ersten Markers mlt elnem ersten OHgonudeotid-Lfnker, wobei der erste Oligonucleotid-Linker 
elne erste Sequenzzur Hybridislerung eines AmpIIfikationsprlmers umfafit, und Verblnden des zweiten Mark- 
15 ers mlt elnem zweiten Oligonucleotid-Unker, wobei der zweite Oligonucleotid-Linker eine zweite Sequenz zur 

Hybridislerung eines Amplifikationsprimers umfafit; und 

Ugasieren des ersten Markers, der mlt dem ersten Oligonucleotid-Unker verbunden ist, mitdem zweiten Mar- 
ker, der mlt dem zweiten OligonucIeotioVLinker verbunden ist, zur Bildung eines Doppelmarkers; 
Bestimmung der Nucleotidsequenz des Doppelmarkers, wobei die Identifizierung eines ersten oder zweiten 
20 Markers in einem Doppelmarker anzeigt, dafi ein Gen, das dem ersten oder zweiten Marker entspricht, in der 

Zelle exprimlert Ist. 

5. Verfahren nach Anspruch 4, das weiterhin Amplifizierung des Doppelmarker-OIigonucIeotids umfa&t 

25 e. Verfahren nach Anspruch 5, das weiterhin die Spaitung des Doppelmarkers durch die erste Restriktionsendonu- 
clease und das Ugasieren der gespaltenen Doppelmarker zur Bildung von Konkatemeren des Doppelmarkers 
umfaBt 

7. Verfahren nach Anspruch 6, wobei das Konkatemer aus 2 bis 200 Doppelmarkern besteht 

30 

8. Verfahren nach Anspruch 7, wobei das Konkatemer aus 8 bis 20 Doppelmarkern besteht 

9. Verfahren nach einem der AnsprOche 4 bis 8, wobei der erste und zweite Oiigonucieotid-Unker die gleichen Nu- 
cleotidsequenzen umfassen. 

35 

10. Verfahren nach einem der AnsprOche 4 bis 8, wobel der erste und zweite Oilgonucieotid-Unker verschiedene. . 
Nucleotldsequenzen umfassen. 

11. Verfahren nach Anspruch 10, wobei der erste und zweite Oligonucleotid-Linker die Sequenz 

40 

5-TTTTACCAGCTTATTCAA7TCGGTCCTCTCGCACAGGGACATG- 
3' 

** S^TGGTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT-S* 
oder 

60 

S' -TTTl I G TAG ACATTCTAGTATCTCGTC AAGTCGG AAG G GAC ATG-3 1 
S-AACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT-S* 

55 — 



haben, wobei A Didesoxy A ist 
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12. Verfahren nach elnem der Anspruche 4 bis 11, wobei die Linker elne Erkennungsstelle fOr e!ne zwelte Restrikti- 
onsendonuclease umfassen, d!e Spaltung an elner Stelle eriaubt die von dsr Erkennungsstelle entfernt 1st. 

13. Verfahren nach Anspruch 12 wobei die zwelte Restriktionsendonuclease eine Typ IIS Endonuclease 1st 

5 

14. Verfahren nach Anspruch 13, wobei die Typ US Endonuclease ausgswahlt 1st aus der Gruppe bestehend aus 
BsmFI und FokJ. 

15. Verfahren nach einem der Anspruche 4 bis 14, wobei der Doppelmarker 12 bis 60 Basenpaare 1st 

10 

16. Verfahren nach Anspruch 15, wobei der Doppelmarker 18 bis 22 Basenpaare 1st. 

17. Verfahren nach elnem der Anspruche 5 bis 16, wobei die Amplifizierung durch Poiymerase-Kettenreaktion (PCR) 
erfolgt 

75 

18. Verfahren nach Anspruch 17, wobei die Primer fQr die PCR ausgew§hlt sind aus der Gruppe bestehend aus 



5 -CC AGCTTATTC AATTC GGTCC-3' 



und 

25 

5'-GTAGACATTCTAGTATCTCGT-3\ 



30 19. Verfahren zum Nachweis von Genexpression umfassend: 

Spaltung einer cDNA-Probe, abgeleitet aus mRNA ein8r Zelle, die ein Gen exprimiert, mit einer ersten Re- 
striktionsendonuclease, wobei die Endonuclease die cDNA an einer definierten Stelle am 5 - oder 3'-Ende der 
cDNA spaltet, wodurch definierte Sequenzmarker hergestellt werden; 
35 . Isollerung elnes 5 - Oder 3'-cDNA-Markers, der zwischen der definierten Position und dem benachbarten Ende 
Beg* 

Ligasleren elnes ersten Pools von Markem mlt einem ersten Oiigonucleotid-Linker, der eine erste Sequenz 

hat, die zur Hybridisierung mit einem Amplifikationsprimer verwendet werden kann, und Ligasleren eines zwel- 

ten Pools von Markern mit einem zweiten Ollgonucleotid-Linker, der eine zwelte Sequenz hat, die zur Hybrl- 
to disierung mit einem Amplifikationsprimer verwendet werden kann, wobei Jeder Primer elne Erkennungsstelle 

fur eine zweite Restriktionsendonuclease umfaAt, wobei die zweite Restriktlonsendonuclease an einer Stelle 

spaltet, die von der Erkennungsstelle entfernt 1st; 

Spaltung der Marker mit einer zweiten Restriktionsendonuclease; 

Ligasieren der zwei Marker-Pools um Doppelmarker herzustellen; 
45 Bestimmung der Nucleotidsequenz eines Doppelmarkers, wobei die Identifizierung eines ersten oder zweiten 

Markers in einem Doppelmarker anzeigt, daR ein Gen, das dem ersten oder zweiten Marker entspricht, In der 

Zelle exprimiert ist. 

20. Verfahren nach Anspruch 19, das weiterhin die Amplifizierung des Doppelmarkers umfaSt 

so 

21. Verfahren nach Anspruch 20, wobei die erste Restriktionsendonuclease eine Erkennungsstelle mit vler Basen- 
paaren hat 

22. Verfahren nach Anspruch 21, wobei die erste Restriktionsendonuclease Nlalil 1st 

55 

23. Verfahren nach einem der Anspnlche 19 bis 22, wobei die cDNA ein Mittel zum Einfangen umfaSt 

24. Verfahren nach Anspruch 23, wobei das Mittel zum Einfangen ein Bindeeiement 1st 
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25. Verfahren nach Anspruch 24, wobei das Bindeelement Biotin ist 

26. Verfahren nach elnem der AnsprQche 19 bis 25, wobel der erste und zweite OHgonucIeotld-Unker die gleichen 
Nucleotidsequenzen umfassen. 

27. Verfahren nach elnem dar Anspriiche 19 bis 25, wobel der erste und zweite OIIgonucIeotid-Linker verschiedene 
Nucleotidsequenzen umfassen. 

28. Verfahren nach Anspruch 27, wobei der erste und zweite Oligonucleotid-Linker die Sequenz 

« 

5-TTTTACCAGCTTATTCAATTCGGTCCTCTCGCACAGGGACATG- 
3* 

3-ATGGTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT-5' 

oder 

S' -TTTl r GTAGACATTCTAGTATCTCGTCAAGTCGGAAGGGACATG-y 
3 , -AACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT-5 , . 

haben, wobei A Didesoxy A ist. 

29. Verfahren nach einem der Anspriiche 19 bis 28, wobei die zweite Restriktionsendonuclease elne Typ IIS Endonu- 
dease ist 

30. Verfahren nach Anspruch 29, wobei die Typ IIS Endonuclease ausgewahlt Ist aus der Gruppe umfassend BsmFl 
und Fokl. 

31. Verfahren nach einem der Anspriiche 19 bis 30, wobei der Doppelrnarker 12 bis 60 Basenpaare Ist 

32. Verfahren nach Anspruch 31, wobei der Doppelrnarker 14 bis 22 Basenpaare ist. 

33. Verfahren nach elnem der Anspruche 19 bis 32, das weiterhin Ligasieren der Doppelrnarker umfa&t, um ein Kon- . 
katemer herzusteilen. 

34. Verfahren nach Anspruch 33, wobei das Konkatemer aus 2 bis 200 Doppelmarkern besteht 

35. Verfahren nach Anspruch 34, wobei das Konkatemer aus 8 bis 20 Doppelmarkern besteht 

36. Verfahren nach einem der AnsprQche 20 bis 35, wobei die Amplrfizierung durch Poiymerase-Kettenreaktion (PCR) 
erfolgt. 

37. Verfahren nach Anspruch 36, wobei die Primer zur PCR ausgewahlt sind aus der Gruppe bestehend aus 
5 , -CCAGC : ^rATTCAATTCGGTCC-3 , 

und 
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5^TAGACATTCTAGTATCTCGT«3\ 



38. Kit, das zum Nachwets von Genexpresslon verwendet werden kann, wobei die Gegenwart elnes cDNA-Doppel- 
markers die Expression elnes Gens anzeigt, das eine Sequenz elnes Markers des Doppelmarkers hat, wobei das 
Kit umfaSt: eln erstes BehSItnis, das elnen ersten Oligonucleotid-Linker mit einer ersten Sequenz enthalt, die zur 
Hybridlsierung mit einem Amplifikationsprimer verwendet werden kann; e!n zweltes Behaltnis, das elnen zweiten 
OIIgonucieotid-Linker mit einer zweiten Sequenz enthalt, die zur Hybridlsierung mit einem Amplifikationsprimer 
verwendet werden kann, wobei die Linker wefterhin eine RestriktionsendonucJeasestelle zur Spaltung von DNA 
an einer Steiie, die entfernt ist von der Erkennungsstelie der Restriktlonsendonuclease, umfassen; eln drittes und 
viertes Behaltnis, das Nucleinsaureprfmer zur Hybridlsierung mit den ersten bzw. zweiten Sequenzen der Linker 
hat; und ein funftes und sechstes Behaltnis, das eine Ligase und gegebenenfails eine zweite Restriktionsendonu- 
ciease enthalt, die DNA an ihrer Erkennungsstelie spaitet. 

39. Kit nach Anspruch 38, wobei die Linker eine Sequenz 

S-TTTTACCAGCTTATTCAATTCGGTCCTCTCGCACAGGGACATG-S' 
3-ATGGTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT-5 , 

odor 

» 

Mill 1 GTAQACATTCTAGTATCTCGTCAAGTCGGAAGGGACATG-3 1 
3 , -AACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT-5 > 

haben, wobei A Didesoxy A Ist. 

40. Kit nach Anspruch 38 Oder 39, wobei die Restriktionsendonuciease eine Typ IIS Endonuclease ist 

41. Kit nach Anspruch 40, wobei die Typ IIS Endonuclease BsmFI Ist 

42. Kit nach einem der Anspruche 38 bis 41, wobei die Primer zur Amplifizierung ausgewahit sind aus der Gruppe 
bestehend aus 



5'-CCAG CTTATTC AATTC GGTC03' 

und 



5*-GTAG ACATTCTAGTATCTC GT-3\ 



43. Isoiiertes Doppelmarker-OIigonucleotld nach Anspruch 1, wobei die zwei definierten Nucleotidsequenzmarker 
Schwanz-an-Schwanz verbunden sind. 

44. Isoiiertes Doppeimarker-QIigonucIeotld nach Anspruch 1 , wobei die Doppetmarker an jedem Ende gespalten© 
Spaitstellen fOr eine Restriktionsendonuciease umfassen. 



23 



EP 0 761 822 B1 



Revendlcations 

1. Oligonucleotide a dimarqueurs iso!6 comprenant au moins deux marqueurs de sequence nucleotidique definls, 
ou les marqueurs de sequence nucleotidique deflnis comprennent la sequence 5' d'un site de dlvage le plus 5* 
d'une endonuclease de restriction ou 3' d'un site de clivage te plus 3' d'une endonuclease de restriction dans un 
ADNc de longueur integrate, ou chaque marqueur correspond a un gene exprlme. 

2. Composition selon la revendication 1 ou ('oligonucleotide consiste en 1 a 200 dimarqueurs. 

3. Composition selon la revendication 2 ou ('oligonucleotide consiste en 8 a 20 dimarqueurs. 

4. Precede pour la detection de ('expression d'un gene comprenant 

la production d'ollgonucleotides d'acide desoxyribonucleique complementaire (ADNc) a partir d'un ARNm 
d'une cellule qui contlent un gene exprime ; 

Pisoiement cTun premier marqueur de sequence nucleotidique a partir d'un premier oligonucleotide d'ADNc et 
d'un second marqueur de sequence nucleotidique a partir d'un second oligonucleotide d'ADNc ou les mar- 
queurs de sequence nucleotidique comprennent la sequence 5' cTun site da clivage le plus 5' d'une premiere 
endonucf6ase de restriction ou 3' d'un site de ciivage le plus 3' d'une premiere endonuclease de restriction 
dans un ADNc de longueur integrate; 

la liaison du premier marqueur a un premier lieur oiigonucleotidlque, ou le premier lieur ollgonucleotidique 
comprend une premiere sequence pour I'hybridation d'une amorce d'amplificatlon et la liaison du second mar- 
queur a un second lieur oiigonucleotidlque, ou le second lieur ollgonucleotidique comprend une seconde se- 
quence pour ('hybridation cTune amorce d'amplrfication ; et 

la ligature du premier marqueur lie au premier lieur ollgonucleotidique au second marqueur li§ au second lieur 
oiigonucleotidlque pour former un dimarqueur ; 

la determination de la sequence nucleotidique du dimarqueur, ou ('identification d'un premier ou second mar- 
queur dans un dimarqueur indique qu'un gene qui correspond au premier ou second marqueur est exprim6 
dans (a cellule. 

5. Proc6d6 selon la revendication 4 comprenant en outre i'amplrftcation de ('oligonucleotide a dimarqueurs. 

6. Precede selon la revendication 5 comprenant en outre le clivage du dimarqueur avec la premiere endonuclease 
de restriction et la ligature des dimarqueurs cliv6s pour former des concatemferes des dimarqueurs. 

7. Proc6de selon la revendication 6 ou le concatemere consiste en 2 a 200 dimarqueurs. • 

8. Proc6d6 selon la revendication 7 oD le concatemere consiste en 8 a 20 dimarqueurs. 

9. Precede selon rune quelconque des revendlcations 4 a 8 ou les . premier et second Hours oligonucleotidiques 
comprennent les memes sequences nucleotidiques. 

10. Proc6de selon Cune quelconque des revendications 4 a 8 ou les premier et second lieurs oiigonucieotldiques 
comprennent des sequences nucleotidiques differentes. 

11. Precede selon la revendication 10 ou les premier et second lieurs oiigonucieotldiques ont une sequence: 

5 i -TTTTACCAGCTTATTCAATTCGGTCCTCTCGCACAGGGACATG-3 1 
3 1 -ATGGTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT-5 1 

ou 
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5 1 -TTTTTGTAGACATTCTAGTATCTCGTCAAGTCGGAAGGGACATG-3 T 
3 1 -AACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT-5 1 

oil A est didesoxy A. 

12. Precede selon Tun© quelconque des revendicatlons 4 a 11 oil les lieurs comprennent un site de reconnaissance 
pour una seconds endonudease de restriction qui permet te divage a un site distant du site de reconnaissance. 

13. Precede selon la revendf cation 12 ou fa seconde endonudease de restriction est una endonudease de type IIS. 

14. Precede selon (a revendication 13 ou Pendonudease de type IIS est cholsie dans le groups consistent en BsmFI 
et Fokl. 

15. Precede selon Tune quelconque des revendicatlons 4 a 14 ou le dlmarqueur est de 12 a 60 paires de bases. 

r 

16. Precede seton la revendication 15 ou le dimarqueur est de 18 a 22 paires de bases. 

1 7. Precede selon Tune quelconque des revendicatlons 5 a 1 6 ou I'ampliflcation est realises par amplification en chaTne 
par polymerase (PCR). 

18. Precede selon la revendication 17 oil les amorces pour la PCR sont choisies dans le groupe conslstant en: 

5 1 -CCAGCTTATTCAATTCGGTCC-3 1 

et 

5 1 -GTAGACATTCTAGTATCTCGT-3 ' - 

1 9. Precede pour la detection de I'expresslon d'un gene comprenant 

le clrvage d'un echantiilon d'ADNc derive cf un ARNm d'une cellule qui exprime un gene avec une premiere 
endonudease de restriction, ou Pendonudease dive I'ADNc a une position definle a Pextremite 5' ou 3' de 
t'ADNc pour produire des marqueurs de sequence definis; 

Pisolement d'un marqueur d'ADNc 5' ou 3' situe entre la position definle et Pextremlte adjacente ; 
la ligature cfun premier groupe de marqueurs avec un premier lieur oligonudeotldlque ayant une premiere 
sequence utile pour Phybridatlon avec une amorce ^amplification et la ligature d'un second groupe de mar- 
queurs avec un second lieur oligonudeotldlque ayant une seconde sequence utile pour Thybridation avec une 
amorce ©"amplification, ou chaque amorce comprend un site de reconnaissance pour une seconde endonu- 
dease de restriction, ou la seconde endonudease de restriction dive a un site distant du site de 
reconnaissance ; 

le divage des marqueurs avec la seconde endonudease de restriction; 
la ligature des deux groupes de marqueurs pour produire des dimarqueurs ; 

la determination de la sequence nucieotidique d'un dimarqueur, ou Identification d'un premier ou second 
marqueurdans un dimarqueur indique qu'un gene qui correspond au premier ou second marqueur est exprime 
dans la cellule. 

20. Proced6 selon la revendication 19 comprenant en outre ['amplification du dimarqueur. 

21. Precede selon la revendication 20 ou la premiere endonudease de restriction a un site de reconnaissance a quatre 
paires de bases. 

22. Precede selon la revendication 21 oil la premiere endonudease de restriction est Nlalll. 
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23. ProcedS selon Pune quelconque des revendication 19 a 22 ou PADNc comprend un moyen da capture. 

24. Precede salon la revendication 23 ou la moyan da capture est un element ilant 

25. ProeedS selon la revendlcatlon 24 ou Pelement Ilant est la blotlne. 

26. Precede selon Pune quelconque des revendications 19 a 25 oD les premier et second lleurs oHgonucleottdiques 
comprennent les memes sequences nucleotidlques. 

27. Procede selon Pune quelconque des revendications 19 a 25 ou les premier et second lieurs oligonucieottdiques 
comprennent des sequences nucleotidlques differentes. 

28. Proc6d6 selon la revendlcatlon 27 ou les premier et second lieurs oligonucleotidiques ont une sequence: 

5*' -TTTTACCAGCTTATTCAATTCGGTCCTCTCGCACAGGGACATG—3 1 
3 1 -ATGGTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT-5 1 

ou 

5 1 -TTTTTGTAGACATTCTAGTATCTCGTCAAGTCGGAAGGGACATG -3 1 
3 ' -AACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT-5 1 

ou A est didesoxy A. 

29. Procede selon Pune quelconque des revendications 19 a 28 ou la seconde endonuclease de restriction est une 
endonuclease de type US. 

30. Proced6 selon la revendication 29 ou Pendonudease de type IIS est choisie dans le groupe conslstant en BsmFI 
et Fokl. 

31. Proc6d6 selon Pune quelconque des revendications 19 a 30 ou le dimarqueur est de 12 a 60 paires de bases. 

32. Procede selon la revendication 31 ou le dimarqueur est de 14 a 22 paires de bases. 

33. Procede selon Pune quelconque des revendications 19 a 32 comprenant jen outre la ligature des dimarqueurs pour 
produire un concatemere. 

34. ProcedS selon la revendication 33 ou le concatemere consiste en 2 a 200 dimarqueurs. 

35. Proceed selon la revendication 34 ou le concatemere consiste en 8 a 20 dimarqueurs. 

36. Procede selon Pune quelconque des revendications 20 a 35 ou {'amplification est realisee par ampPificatlon en 
chaTne par polymerase (PCR). 

37. ProcedS seton la revendication 36 ou ies amorces pour la PCR sont choisies dans le groupe conslstant en: 

5 ' -CCAGCTTATTCAATTCGGTCC-3 1 

et 

5 ' -GTAGACATTCTAGTATCTCGT-3 ' . 

38. Kit utile pour la detection de ('expression cf un gene ou la presence d'un dimarqueur d'ADNc indique ('expression 
cTun gene ayant une sequence d'un marqueur du dimarqueur, la kit comprenant un premier recipient contenant 
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un premier lleur oligonucieotidlque ayant uno premiere sequence utile pour Phybridatlon avec une amorce 
d'ampllficatlon ; un second recipient contenant un second lleur oilgonucieotidique ayant une seconde sequence 
utile pour Phybridatlon avec une amorce ^amplification, ou les lieurs comprennent en outre un site tfendonuclease 
de restriction pour le cllvage d'ADN a un site distant du site de reconnaissance pour Pendonuclease de restriction ; 
et un troisleme et un quatrieme recipient ayant des amorces d'acide nucleique pour Phybridatlon avec les premiere 
et seconde sequences du lleur, et un dnquleme et un slxleme recipient contenant une ilgase et t eventuellement, 
une seconde endonuclease de restriction qui dive I'ADN a son site de reconnaissance. 

39. Kit selon la revendlcatlon 38 ou les lieurs ont une sequence: 

5 i -TTTTACCAGCTTATTCAATTCGGTCCTCTCGCACAGGGACATG-3 ' 
3 ■ -ATGGTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT-5 1 

ou 

5 1 -TTTTTGTAGACATTCTAGTATCTCGTCAAGTCGGAAGGGACATG-3 T 
3 ? -AACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT-5 1 

ou A est didesoxyA. 

40. Kit seion la revendlcatlon 38 ou 39 ou Pendonudease de restriction est une endonuclease de type IIS. 

41. Kit selon la revendlcation 40 ou Pendonuclease de type IIS est BsmFI. 

42. Kit selon Pune quelconque des revendications 38 a 41 ou les amorces pour {'amplification sont cholsles dans le 
groupe consistant en: . 

5 1 -CCAGCtTATTCAATTCGGTCC-3 » . 

et 

5 1 -GTAGACATTCTAGTATCTCGT-3 ■ . 

43. Oligonucleotide a dimarqueurs isole seion la revendication 1 ou les deux marqueurs de sequence nucieotldique 
definis sont joints en mode queue a queue. 

44. Oligonucleotide a dimarqueurs Isole selon la revendication 1 ou les dimarqueurs comprennent a chaque extremity 
des sites de clivage pour une endonuclease de restriction dives. 
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FIGURE 1 
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FIGURE 2 
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FIGURE 3 
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